Add a public API for copying a table_view to device array #18450

Matt711 · 2025-04-07T21:06:45Z

Description

Contributes to #16483.

This PR adds a new libcudf API: cudf::table_to_array, which copies data from a table_view into a preallocated column-major device array using cub::DeviceMemcpy::Batched.

The primary use case for this API is to accelerate the conversion of a cudf.DataFrame to a CuPy array when users access DataFrame.values in Python.

In a follow-up PR, I'll integrate this API into the cudf Python layer.

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-04-07T21:06:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Matt711 · 2025-04-07T21:07:20Z

/ok to test

cpp/src/reshape/table_to_array.cu

bdice

Initial feedback attached. Thanks for your work on this!

cpp/include/cudf/reshape.hpp

cpp/src/reshape/table_to_array.cu

cpp/tests/CMakeLists.txt

cpp/tests/reshape/table_to_array.cpp

cpp/src/reshape/table_to_array.cu

Matt711 · 2025-04-08T18:50:58Z

Benchmark Results

table_to_array

[0] NVIDIA RTX 5880 Ada Generation

num_rows	columns	Samples	CPU Time	Noise	GPU Time	Noise	GlobalMem BW	BWUtil
262144	2	24544x	24.381 us	20.36%	20.378 us	5.97%	205.824 GB/s	21.44%
2097152	2	8656x	61.408 us	6.84%	57.816 us	2.96%	580.363 GB/s	60.45%
16777216	2	1424x	357.499 us	1.32%	353.912 us	0.85%	758.480 GB/s	79.00%
32768	10	25488x	23.607 us	21.45%	19.629 us	6.31%	133.547 GB/s	13.91%
262144	10	11792x	46.189 us	9.27%	42.504 us	3.45%	493.399 GB/s	51.39%
2097152	10	2192x	233.055 us	1.91%	229.399 us	1.06%	731.356 GB/s	76.18%
16777216	10	544x	1.666 ms	1.23%	1.663 ms	1.21%	807.174 GB/s	84.07%
32768	100	9744x	55.100 us	8.09%	51.326 us	3.50%	510.739 GB/s	53.20%
262144	100	1744x	290.759 us	1.57%	287.065 us	0.91%	730.551 GB/s	76.09%
2097152	100	352x	2.059 ms	1.30%	2.056 ms	1.29%	816.148 GB/s	85.01%
16777216	100	32x	16.006 ms	0.26%	16.002 ms	0.25%	838.756 GB/s	87.36%

[1] NVIDIA RTX 5880 Ada Generation

num_rows	columns	Samples	CPU Time	Noise	GPU Time	Noise	GlobalMem BW	BWUtil
32768	2	15344x	36.213 us	13.56%	32.615 us	8.08%	16.075 GB/s	1.67%
262144	2	14064x	39.508 us	12.78%	35.582 us	6.11%	117.878 GB/s	12.28%
2097152	2	6496x	80.498 us	4.96%	77.064 us	2.18%	435.412 GB/s	45.35%
16777216	2	1648x	546.025 us	1.37%	542.291 us	1.18%	495.003 GB/s	51.56%
32768	10	13840x	39.317 us	9.62%	36.142 us	4.01%	72.532 GB/s	7.55%
262144	10	8544x	61.967 us	6.53%	58.526 us	2.85%	358.326 GB/s	37.32%
2097152	10	2064x	246.372 us	2.02%	242.862 us	1.07%	690.814 GB/s	71.95%
16777216	10	928x	1.877 ms	2.02%	1.873 ms	2.01%	716.410 GB/s	74.62%
32768	100	7632x	68.997 us	5.91%	65.532 us	2.68%	400.023 GB/s	41.66%
262144	100	1680x	303.596 us	1.48%	300.142 us	0.93%	698.719 GB/s	72.78%
2097152	100	576x	2.117 ms	0.80%	2.114 ms	0.78%	793.753 GB/s	82.67%
16777216	100	31x	16.148 ms	0.26%	16.144 ms	0.26%	831.395 GB/s	86.60%

cpp/benchmarks/reshape/table_to_array.cpp

cpp/src/reshape/table_to_array.cu

cpp/benchmarks/reshape/table_to_array.cpp

cpp/src/reshape/table_to_array.cu

ttnghia · 2025-04-24T17:07:38Z

What is the common functionality between this and chunk_pack (#13260)? If they are similar, can the two be refactored to share implementation?

bdice

A few comments -- overall this is close!

cpp/benchmarks/reshape/table_to_array.cpp

cpp/include/cudf/reshape.hpp

cpp/src/reshape/table_to_array.cu

cpp/tests/reshape/table_to_array_tests.cpp

bdice · 2025-04-25T05:03:06Z

What is the common functionality between this and chunk_pack (#13260)? If they are similar, can the two be refactored to share implementation?

I looked over the splitting/copying APIs. This is similar to cudf::pack but does not serialize any table metadata or null masks. It's just the raw contents of the columns (must be same type), concatenated into an existing user-provided GPU buffer. That last part (an existing buffer) is crucial because we want to hand off memory ownership of the results to CuPy or PyTorch without incurring a copy or tying the lifetime of the results to a specific RMM memory resource. I don't see any obvious overlap here, but I also haven't read the source code of those APIs to know if there are common paths.

vuule

few places where the code could be simplified, otherwise looks great!

cpp/src/reshape/table_to_array.cu

vuule

thank you for addressing the comments!
few more things and it should be good to go.

vuule · 2025-05-05T21:39:04Z

cpp/include/cudf/reshape.hpp

+ */
+void table_to_array(table_view const& input,
+                    device_span<cuda::std::byte> output,
+                    cudf::data_type output_dtype,


could we avoid passing this and derive this type from the column types, since we throw on mismatch anyway?

Done b71ec3a

cpp/src/reshape/table_to_array.cu

vuule

🔥

Matt711 · 2025-05-07T14:49:25Z

Thanks all!

Matt711 · 2025-05-07T14:49:33Z

/merge

Contributes to #16483 by adding fast paths to `DataFrame.to_cupy` (which is called when `DataFrame.values` is called). The PR follows up #18450 to add cython bindings for `cudf::table_to_array` to pylibcudf and plumbs those changes through cudf classic. I benchmarked the fast (True) and slow (False) when the dataframe has 1, 6, 20, and 100 columns. The fast paths use `cudf::table_to_array` if the number of columns is greater than 1 and `cp.asarray` directly if the dataframe has only one column. The slow path uses a [raw python loop + assignment](https://github.com/rapidsai/cudf/blob/35d58394e7fb5a090ff3cda351403ec092476af5/python/cudf/cudf/core/frame.py#L520) to create the cupy array. ![image](https://github.com/user-attachments/assets/4c9edfa0-e15d-4902-b597-675cfb02343d) I used the median because the CUDA overhead of calling `cudf::table_to_array` is large (so there are outliers in the times). Here is a profile of calling `to_cupy` twice for both the slow and fast paths. ![Screenshot from 2025-05-13 12-23-46](https://github.com/user-attachments/assets/d84fdfa3-3696-4df8-91b6-3eb9dde65430) In the first calls, the fast path takes 7.3 ms vs 4.8 ms for the slow path. The first call to `cudf::table_to_array` is the bottleneck. But if you compare the second calls, the fast path is much faster (79 us vs 2.3ms) Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Bradley Dice (https://github.com/bdice) - Matthew Roeschke (https://github.com/mroeschke) URL: #18801

Matt711 added 2 commits April 7, 2025 12:34

Add a public API for converting a table_view to device array

e201564

support decimals and add more tests

6e9289e

github-actions bot assigned Matt711 Apr 7, 2025

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Apr 7, 2025

Matt711 added feature request New feature or request non-breaking Non-breaking change labels Apr 7, 2025

Matt711 commented Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

Matt711 marked this pull request as ready for review April 8, 2025 03:51

Matt711 requested review from a team as code owners April 8, 2025 03:51

Merge branch 'branch-25.06' into fea/cpp/table-to-device-array

21a9201

Matt711 requested review from bdice and vuule April 8, 2025 03:51

bdice reviewed Apr 8, 2025

View reviewed changes

Matt711 and others added 3 commits April 8, 2025 09:53

fallback if cuda version < 12.8

7eb3690

Merge branch 'branch-25.06' into fea/cpp/table-to-device-array

854cbbf

clean up

5a9d195

davidwendt reviewed Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

address reviews

e5e65cc

Matt711 commented Apr 8, 2025

View reviewed changes

cpp/benchmarks/reshape/table_to_array.cpp Outdated Show resolved Hide resolved

Matt711 commented Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

davidwendt reviewed Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

davidwendt reviewed Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

davidwendt reviewed Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

davidwendt reviewed Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

davidwendt reviewed Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

Matt711 requested review from bdice and vuule April 21, 2025 15:00

Matt711 added this to libcudf Apr 22, 2025

Matt711 moved this to In progress in libcudf Apr 22, 2025

lamarrr reviewed Apr 24, 2025

View reviewed changes

cpp/benchmarks/reshape/table_to_array.cpp Outdated Show resolved Hide resolved

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

bdice reviewed Apr 25, 2025

View reviewed changes

vuule requested a review from bdice April 25, 2025 19:20

vuule reviewed Apr 28, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

Matt711 added this to cuDF Python May 1, 2025

Matt711 moved this to In Progress in cuDF Python May 1, 2025

Matt711 and others added 3 commits May 5, 2025 12:10

Merge branch 'branch-25.06' into fea/cpp/table-to-device-array

1530b5f

address reviews

fd0506b

address review

db78162

Matt711 requested review from lamarrr and vuule May 5, 2025 19:45

vuule reviewed May 5, 2025

View reviewed changes

Matt711 and others added 2 commits May 6, 2025 12:23

Merge branch 'branch-25.06' into fea/cpp/table-to-device-array

f5bf21c

address review

b71ec3a

Matt711 requested a review from vuule May 6, 2025 17:31

vuule approved these changes May 6, 2025

View reviewed changes

bdice approved these changes May 6, 2025

View reviewed changes

rapids-bot bot merged commit 7e555e0 into rapidsai:branch-25.06 May 7, 2025
112 of 113 checks passed

github-project-automation bot moved this from In Progress to Done in cuDF Python May 7, 2025

GregoryKimball removed this from libcudf May 8, 2025

Matt711 mentioned this pull request May 13, 2025

Add fast paths for DataFrame.to_cupy #18801

Merged

3 tasks

Add a public API for copying a table_view to device array #18450

Add a public API for copying a table_view to device array #18450

Uh oh!

Conversation

Matt711 commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

copy-pr-bot bot commented Apr 7, 2025

Uh oh!

Matt711 commented Apr 7, 2025

Uh oh!

Uh oh!

bdice left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Matt711 commented Apr 8, 2025

Benchmark Results

table_to_array

[0] NVIDIA RTX 5880 Ada Generation

[1] NVIDIA RTX 5880 Ada Generation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ttnghia commented Apr 24, 2025

Uh oh!

bdice left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bdice commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vuule left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vuule left a comment

Choose a reason for hiding this comment

Uh oh!

vuule May 5, 2025

Choose a reason for hiding this comment

Uh oh!

Matt711 May 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vuule left a comment

Choose a reason for hiding this comment

Uh oh!

Matt711 commented May 7, 2025

Uh oh!

Matt711 commented May 7, 2025

Uh oh!

Matt711 commented Apr 7, 2025 •

edited

Loading

bdice commented Apr 25, 2025 •

edited

Loading